Building a song recommender

Fire up GraphLab Create



In [1]:

    
import graphlab

Load music data



In [2]:

    
song_data = graphlab.SFrame('song_data.gl/')









    



[INFO] This non-commercial license of GraphLab Create is assigned to j.ryan.rembert@gmail.com and will expire on October 13, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-2685 - Server binary: /Users/jrrembert/venvs/dato-env/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1447645794.log
[INFO] GraphLab Server Version: 1.6.1

Explore data

Music data shows how many times a user listened to a song, as well as the details of the song.



In [3]:

    
song_data.head()









    Out[3]:





    
        user_id
        song_id
        listen_count
        title
        artist
    
    
        b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
        SOAKIMP12A8C130995
        1
        The Cove
        Jack Johnson
    
    
        b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
        SOBBMDR12A8C13253B
        2
        Entre Dos Aguas
        Paco De Lucia
    
    
        b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
        SOBXHDL12A81C204C0
        1
        Stronger
        Kanye West
    
    
        b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
        SOBYHAJ12A6701BF1D
        1
        Constellations
        Jack Johnson
    
    
        b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
        SODACBL12A8C13C273
        1
        Learn To Fly
        Foo Fighters
    
    
        b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
        SODDNQT12A6D4F5F7E
        5
        Apuesta Por El Rock 'N'
Roll ...
        Héroes del Silencio
    
    
        b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
        SODXRTY12AB0180F3B
        1
        Paper Gangsta
        Lady GaGa
    
    
        b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
        SOFGUAY12AB017B0A8
        1
        Stacked Actors
        Foo Fighters
    
    
        b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
        SOFRQTD12A81C233C0
        1
        Sehr kosmisch
        Harmonia
    
    
        b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
        SOHQWYZ12A6D4FA701
        1
        Heaven's gonna burn your
eyes ...
        Thievery Corporation
feat. Emiliana Torrini ...
    


    
        song
    
    
        The Cove - Jack Johnson
    
    
        Entre Dos Aguas - Paco De
Lucia ...
    
    
        Stronger - Kanye West
    
    
        Constellations - Jack
Johnson ...
    
    
        Learn To Fly - Foo
Fighters ...
    
    
        Apuesta Por El Rock 'N'
Roll - Héroes del ...
    
    
        Paper Gangsta - Lady GaGa
    
    
        Stacked Actors - Foo
Fighters ...
    
    
        Sehr kosmisch - Harmonia
    
    
        Heaven's gonna burn your
eyes - Thievery ...
    

[10 rows x 6 columns]

Showing the most popular songs in the dataset



In [4]:

    
graphlab.canvas.set_target('ipynb')



In [5]:

    
song_data['song'].show()



In [6]:

    
len(song_data)









    Out[6]:





1116609

Count number of unique users in the dataset



In [7]:

    
users = song_data['user_id'].unique()



In [8]:

    
len(users)









    Out[8]:





66346

Q1: Find artist with most number of unique listeners



In [20]:

    
kanye_songs = song_data[song_data['artist'] == 'Kanye West']
foo_songs = song_data[song_data['artist'] == 'Foo Fighters']
swift_songs = song_data[song_data['artist'] == 'Taylor Swift']
gaga_songs = song_data[song_data['artist'] == 'Lady GaGa']



In [21]:

    
kanye_users = kanye_songs['user_id'].unique()
foo_users = foo_songs['user_id'].unique()
swift_users = swift_songs['user_id'].unique()
gaga_users = gaga_songs['user_id'].unique()



In [25]:

    
print "{} {} {} {}".format(len(kanye_users), len(foo_users), len(swift_users), len(gaga_users))









    



2522 2055 3246 2928

Q2: Find artist with most listeners



In [29]:

    
listener_count = song_data.groupby(key_columns='artist', operations={'total_count': graphlab.aggregate.SUM('listen_count')})



In [33]:

    
listener_count = listener_count.sort('total_count', ascending=False)
listener_count.head()









    Out[33]:





    
        artist
        total_count
    
    
        Kings Of Leon
        43218
    
    
        Dwight Yoakam
        40619
    
    
        Björk
        38889
    
    
        Coldplay
        35362
    
    
        Florence + The Machine
        33387
    
    
        Justin Bieber
        29715
    
    
        Alliance Ethnik
        26689
    
    
        OneRepublic
        25754
    
    
        Train
        25402
    
    
        The Black Keys
        22184
    

[10 rows x 2 columns]

Q3: Artist with least listeners



In [34]:

    
listener_count = listener_count.sort('total_count', ascending=True)
listener_count.head()









    Out[34]:





    
        artist
        total_count
    
    
        William Tabbert
        14
    
    
        Reel Feelings
        24
    
    
        Beyoncé feat. Bun B and
Slim Thug ...
        26
    
    
        Boggle Karaoke
        30
    
    
        Diplo
        30
    
    
        harvey summers
        31
    
    
        Nâdiya
        36
    
    
        Aneta Langerova
        38
    
    
        Jody Bernal
        38
    
    
        Kanye West / Talib Kweli
/ Q-Tip / Common / ...
        38
    

[10 rows x 2 columns]

Create a song recommender



In [ ]:



In [9]:

    
train_data,test_data = song_data.random_split(.8,seed=0)

Simple popularity-based recommender



In [10]:

    
popularity_model = graphlab.popularity_recommender.create(train_data,
                                                         user_id='user_id',
                                                         item_id='song')









    



PROGRESS: Recsys training: model = popularity
PROGRESS: Warning: Ignoring columns song_id, listen_count, title, artist;
PROGRESS:     To use one of these as a target column, set target = <column_name>
PROGRESS:     and use a method that allows the use of a target.
PROGRESS: Preparing data set.
PROGRESS:     Data has 893580 observations with 66085 users and 9952 items.
PROGRESS:     Data prepared in: 0.763742s
PROGRESS: 893580 observations to process; with 9952 unique items.

Use the popularity model to make some predictions

A popularity model makes the same prediction for all users, so provides no personalization.



In [ ]:

    
popularity_model.recommend(users=[users[0]])



In [ ]:

    
popularity_model.recommend(users=[users[1]])

Build a song recommender with personalization

We now create a model that allows us to make personalized recommendations to each user.



In [ ]:

    
personalized_model = graphlab.item_similarity_recommender.create(train_data,
                                                                user_id='user_id',
                                                                item_id='song')

Applying the personalized model to make song recommendations

As you can see, different users get different recommendations now.



In [ ]:

    
personalized_model.recommend(users=[users[0]])



In [ ]:

    
personalized_model.recommend(users=[users[1]])

We can also apply the model to find similar songs to any song in the dataset



In [ ]:

    
personalized_model.get_similar_items(['With Or Without You - U2'])



In [ ]:

    
personalized_model.get_similar_items(['Chan Chan (Live) - Buena Vista Social Club'])

Quantitative comparison between the models

We now formally compare the popularity and the personalized models using precision-recall curves.



In [ ]:

    
if graphlab.version[:3] >= "1.6":
    model_performance = graphlab.compare(test_data, [popularity_model, personalized_model], user_sample=0.05)
    graphlab.show_comparison(model_performance,[popularity_model, personalized_model])
else:
    %matplotlib inline
    model_performance = graphlab.recommender.util.compare_models(test_data, [popularity_model, personalized_model], user_sample=.05)

The curve shows that the personalized model provides much better performance.

Q4: Find most recommended song to the first 10k users



In [35]:

    
train_data,test_data = song_data.random_split(.8,seed=0)



In [39]:

    
item_similarity_model = graphlab.item_similarity_recommender.create(train_data,
                                                             user_id='user_id',
                                                             item_id='song')









    



PROGRESS: Recsys training: model = item_similarity
PROGRESS: Warning: Ignoring columns song_id, listen_count, title, artist;
PROGRESS:     To use one of these as a target column, set target = <column_name>
PROGRESS:     and use a method that allows the use of a target.
PROGRESS: Preparing data set.
PROGRESS:     Data has 893580 observations with 66085 users and 9952 items.
PROGRESS:     Data prepared in: 0.804962s
PROGRESS: Computing item similarity statistics:
PROGRESS: Computing most similar items for 9952 items:
PROGRESS: +-----------------+-----------------+
PROGRESS: | Number of items | Elapsed Time    |
PROGRESS: +-----------------+-----------------+
PROGRESS: | 1000            | 1.46503         |
PROGRESS: | 2000            | 1.49916         |
PROGRESS: | 3000            | 1.5328          |
PROGRESS: | 4000            | 1.56653         |
PROGRESS: | 5000            | 1.59924         |
PROGRESS: | 6000            | 1.63177         |
PROGRESS: | 7000            | 1.66651         |
PROGRESS: | 8000            | 1.70576         |
PROGRESS: | 9000            | 1.75355         |
PROGRESS: +-----------------+-----------------+
PROGRESS: Finished training in 2.10749s



In [40]:

    
subset_test_users = test_data['user_id'].unique()[0:10000]



In [43]:

    
rec_songs = item_similarity_model.recommend(subset_test_users,k=1)









    



PROGRESS: recommendations finished on 1000/10000 queries. users per second: 1567.02
PROGRESS: recommendations finished on 2000/10000 queries. users per second: 1514.73
PROGRESS: recommendations finished on 3000/10000 queries. users per second: 1550.84
PROGRESS: recommendations finished on 4000/10000 queries. users per second: 1567.67
PROGRESS: recommendations finished on 5000/10000 queries. users per second: 1573.07
PROGRESS: recommendations finished on 6000/10000 queries. users per second: 1581.12
PROGRESS: recommendations finished on 7000/10000 queries. users per second: 1593.36
PROGRESS: recommendations finished on 8000/10000 queries. users per second: 1600.14
PROGRESS: recommendations finished on 9000/10000 queries. users per second: 1596.26
PROGRESS: recommendations finished on 10000/10000 queries. users per second: 1587.42



In [45]:

    
rec_songs.head()









    Out[45]:





    
        user_id
        song
        score
        rank
    
    
        c067c22072a17d33310d7223d
7b79f819e48cf42 ...
        Grind With Me (Explicit
Version) - Pretty Ricky ...
        0.0459424433009
        1
    
    
        696787172dd3f5169dc94deef
97e427cee86147d ...
        Senza Una Donna (Without
A Woman) - Zucchero / ...
        0.0170265780731
        1
    
    
        532e98155cbfd1e1a474a28ed
96e59e50f7c5baf ...
        Jive Talkin' (Album
Version) - Bee Gees ...
        0.0118288659232
        1
    
    
        18325842a941bc58449ee71d6
59a08d1c1bd2383 ...
        Goodnight And Goodbye -
Jonas Brothers ...
        0.0168060865646
        1
    
    
        507433946f534f5d25ad1be30
2edb9a2376f503c ...
        Find The Cost Of Freedom
- Crosby_ Stills_ Nash & ...
        0.0165806601546
        1
    
    
        18fafad477f9d72ff86f7d0bd
838a6573de0f64a ...
        Rabbit Heart (Raise It
Up) - Florence + The ...
        0.0799450902285
        1
    
    
        fe85b96ba1983219b296f6b48
69dd29eb2b72ff9 ...
        Secrets - OneRepublic
        0.079137043826
        1
    
    
        225ea420b4bede50919d1bfe2
4a599691522d176 ...
        Alejandro - Lady GaGa
        0.0273359193626
        1
    
    
        95dc7e2b188b1148b2d25f4e6
b6e94afacc4efc3 ...
        Bust a Move - Infected
Mushroom ...
        0.0534825732628
        1
    
    
        4a3a1ae2748f12f7ab921a47d
6d79abf82e3e325 ...
        Isis (Spam Remix) -
Alaska Y Dinarama ...
        0.0418030208549
        1
    

[10 rows x 4 columns]



In [47]:

    
song_count = rec_songs.groupby(key_columns='song', operations={'count': graphlab.aggregate.COUNT()})



In [49]:

    
song_count.sort('count', ascending=False)









    Out[49]:





    
        song
        count
    
    
        Undo - Björk
        432
    
    
        Secrets - OneRepublic
        374
    
    
        Revelry - Kings Of Leon
        233
    
    
        You're The One - Dwight
Yoakam ...
        165
    
    
        Fireflies - Charttraxx
Karaoke ...
        118
    
    
        Hey_ Soul Sister - Train
        106
    
    
        Horn Concerto No. 4 in E
flat K495: II. Romance ...
        92
    
    
        Sehr kosmisch - Harmonia
        85
    
    
        OMG - Usher featuring
will.i.am ...
        63
    
    
        Clocks - Coldplay
        49
    

[3142 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [ ]:

user_id	song_id	listen_count	title	artist
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SOAKIMP12A8C130995	1	The Cove	Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SOBBMDR12A8C13253B	2	Entre Dos Aguas	Paco De Lucia
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SOBXHDL12A81C204C0	1	Stronger	Kanye West
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SOBYHAJ12A6701BF1D	1	Constellations	Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SODACBL12A8C13C273	1	Learn To Fly	Foo Fighters
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SODDNQT12A6D4F5F7E	5	Apuesta Por El Rock 'N' Roll ...	Héroes del Silencio
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SODXRTY12AB0180F3B	1	Paper Gangsta	Lady GaGa
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SOFGUAY12AB017B0A8	1	Stacked Actors	Foo Fighters
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SOFRQTD12A81C233C0	1	Sehr kosmisch	Harmonia
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...	SOHQWYZ12A6D4FA701	1	Heaven's gonna burn your eyes ...	Thievery Corporation feat. Emiliana Torrini ...

artist	total_count
Kings Of Leon	43218
Dwight Yoakam	40619
Björk	38889
Coldplay	35362
Florence + The Machine	33387
Justin Bieber	29715
Alliance Ethnik	26689
OneRepublic	25754
Train	25402
The Black Keys	22184

artist	total_count
William Tabbert	14
Reel Feelings	24
Beyoncé feat. Bun B and Slim Thug ...	26
Boggle Karaoke	30
Diplo	30
harvey summers	31
Nâdiya	36
Aneta Langerova	38
Jody Bernal	38
Kanye West / Talib Kweli / Q-Tip / Common / ...	38

user_id	song	score	rank
c067c22072a17d33310d7223d 7b79f819e48cf42 ...	Grind With Me (Explicit Version) - Pretty Ricky ...	0.0459424433009	1
696787172dd3f5169dc94deef 97e427cee86147d ...	Senza Una Donna (Without A Woman) - Zucchero / ...	0.0170265780731	1
532e98155cbfd1e1a474a28ed 96e59e50f7c5baf ...	Jive Talkin' (Album Version) - Bee Gees ...	0.0118288659232	1
18325842a941bc58449ee71d6 59a08d1c1bd2383 ...	Goodnight And Goodbye - Jonas Brothers ...	0.0168060865646	1
507433946f534f5d25ad1be30 2edb9a2376f503c ...	Find The Cost Of Freedom - Crosby_ Stills_ Nash & ...	0.0165806601546	1
18fafad477f9d72ff86f7d0bd 838a6573de0f64a ...	Rabbit Heart (Raise It Up) - Florence + The ...	0.0799450902285	1
fe85b96ba1983219b296f6b48 69dd29eb2b72ff9 ...	Secrets - OneRepublic	0.079137043826	1
225ea420b4bede50919d1bfe2 4a599691522d176 ...	Alejandro - Lady GaGa	0.0273359193626	1
95dc7e2b188b1148b2d25f4e6 b6e94afacc4efc3 ...	Bust a Move - Infected Mushroom ...	0.0534825732628	1
4a3a1ae2748f12f7ab921a47d 6d79abf82e3e325 ...	Isis (Spam Remix) - Alaska Y Dinarama ...	0.0418030208549	1

song	count
Undo - Björk	432
Secrets - OneRepublic	374
Revelry - Kings Of Leon	233
You're The One - Dwight Yoakam ...	165
Fireflies - Charttraxx Karaoke ...	118
Hey_ Soul Sister - Train	106
Horn Concerto No. 4 in E flat K495: II. Romance ...	92
Sehr kosmisch - Harmonia	85
OMG - Usher featuring will.i.am ...	63
Clocks - Coldplay	49